44 research outputs found
External uniform electric field removing flexoelectric effect in epitaxial ferroelectric thin films
Using the modified Landau-Ginsburg-Devonshire thermodynamic theory, it is
found that the coupling between stress gradient and polarization, or
flexoelectricity, has significant effect on ferroelectric properties of
epitaxial thin films, such as polarization, free energy profile and hysteresis
loop. However, this effect can be completely eliminated by applying an
optimized external, uniform electric field. The role of such uniform electric
field is shown to be the same as that of an ideal gradient electric field which
can suppress the flexoelectricty effect completely based on the present theory.
Since the uniform electric field is more convenient to apply and control than
gradient electric field, it can be potentially used to remove the flexoelectric
effect induced by stress gradient in epitaxial thin films and enhance the
ferroelectric properties.Comment: 5 pages, 3 figure
VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer
End-to-end singing voice synthesis (SVS) model VISinger can achieve better
performance than the typical two-stage model with fewer parameters. However,
VISinger has several problems: text-to-phase problem, the end-to-end model
learns the meaningless mapping of text-to-phase; glitches problem, the harmonic
components corresponding to the periodic signal of the voiced segment occurs a
sudden change with audible artefacts; low sampling rate, the sampling rate of
24KHz does not meet the application needs of high-fidelity generation with the
full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to
address these issues by integrating the digital signal processing (DSP) methods
with VISinger. Specifically, inspired by recent advances in differentiable
digital signal processing (DDSP), we incorporate a DSP synthesizer into the
decoder to solve the above issues. The DSP synthesizer consists of a harmonic
synthesizer and a noise synthesizer to generate periodic and aperiodic signals,
respectively, from the latent representation z in VISinger. It supervises the
posterior encoder to extract the latent representation without phase
information and avoid the prior encoder modelling text-to-phase mapping. To
avoid glitch artefacts, the HiFi-GAN is modified to accept the waveforms
generated by the DSP synthesizer as a condition to produce the singing voice.
Moreover, with the improved waveform decoder, VISinger 2 manages to generate
44.1kHz singing audio with richer expression and better quality. Experiments on
OpenCpop corpus show that VISinger 2 outperforms VISinger, CpopSing and
RefineSinger in both subjective and objective metrics.Comment: Submitted to ICASSP 202
Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling
This paper aims to synthesize target speaker's speech with desired speaking
style and emotion by transferring the style and emotion from reference speech
recorded by other speakers. Specifically, we address this challenging problem
with a two-stage framework composed of a text-to-style-and-emotion (Text2SE)
module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural
bottleneck (BN) features. To further solve the multi-factor (speaker timbre,
speaking style and emotion) decoupling problem, we adopt the multi-label binary
vector (MBV) and mutual information (MI) minimization to respectively
discretize the extracted embeddings and disentangle these highly entangled
factors in both Text2SE and SE2Wave modules. Moreover, we introduce a
semi-supervised training strategy to leverage data from multiple speakers,
including emotion-labelled data, style-labelled data, and unlabeled data. To
better transfer the fine-grained expressiveness from references to the target
speaker in the non-parallel transfer, we introduce a reference-candidate pool
and propose an attention based reference selection approach. Extensive
experiments demonstrate the good design of our model.Comment: Submitted to ICASSP202
PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions
Style transfer TTS has shown impressive performance in recent years. However,
style control is often restricted to systems built on expressive speech
recordings with discrete style categories. In practical situations, users may
be interested in transferring style by typing text descriptions of desired
styles, without the reference speech in the target style. The text-guided
content generation techniques have drawn wide attention recently. In this work,
we explore the possibility of controllable style transfer with natural language
descriptions. To this end, we propose PromptStyle, a text prompt-guided
cross-speaker style transfer system. Specifically, PromptStyle consists of an
improved VITS and a cross-modal style encoder. The cross-modal style encoder
constructs a shared space of stylistic and semantic representation through a
two-stage training process. Experiments show that PromptStyle can achieve
proper style transfer with text prompts while maintaining relatively high
stability and speaker similarity. Audio samples are available in our demo page
AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation
Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a
pre-trained TTS model to adapt to new target speakers with limited data. While
much effort has been conducted towards this task, seldom work has been
performed for low computational resource scenarios due to the challenges raised
by the requirement of the lightweight model and less computational complexity.
In this paper, a tiny VITS-based TTS model, named AdaVITS, for low computing
resource speaker adaptation is proposed. To effectively reduce parameters and
computational complexity of VITS, an iSTFT-based wave construction decoder is
proposed to replace the upsampling-based decoder which is resource-consuming in
the original VITS. Besides, NanoFlow is introduced to share the density
estimate across flow blocks to reduce the parameters of the prior encoder.
Furthermore, to reduce the computational complexity of the textual encoder,
scaled-dot attention is replaced with linear attention. To deal with the
instability caused by the simplified model, instead of using the original text
encoder, phonetic posteriorgram (PPG) is utilized as linguistic feature via a
text-to-PPG module, which is then used as input for the encoder. Experiment
shows that AdaVITS can generate stable and natural speech in speaker adaptation
with 8.97M model parameters and 0.72GFlops computational complexity.Comment: Accepted by ISCSLP 202
DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP
Recent development of neural vocoders based on the generative adversarial
neural network (GAN) has shown their advantages of generating raw waveform
conditioned on mel-spectrogram with fast inference speed and lightweight
networks. Whereas, it is still challenging to train a universal neural vocoder
that can synthesize high-fidelity speech from various scenarios with unseen
speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a
GAN-based universal vocoder for high-fidelity speech synthesis by applying the
time-frequency domain supervision from digital signal processing (DSP). To
eliminate the mismatch problem caused by the ground-truth spectrograms in
training phase and the predicted spectrograms in inference phase, we leverage
the mel-spectrogram extracted from the waveform generated by a DSP module,
rather than the predicted mel-spectrogram from the Text-to-Speech (TTS)
acoustic model, as the time-frequency domain supervision to the GAN-based
vocoder. We also utilize sine excitation as the time-domain supervision to
improve the harmonic modeling and eliminate various artifacts of the GAN-based
vocoder. Experimental results show that DSPGAN significantly outperforms the
compared approaches and can generate high-fidelity speech based on diverse data
in TTS.Comment: Submitted to ICASSP 202
Energy-Saving and Low-Carbon Gear Blank Dimension Design Based on Business Compass
Sustainable blank dimension design is the key to the implementation of green industrial development. However, blank dimension design only considers the blank production factor of the blank dimension design stage, which cannot guarantee the blank production stage and the use stage’s overall goal. In this paper, based on the guiding thinking of a business compass, a low-carbon and low-energy consumption blank dimension optimization design model was proposed. Taking the process parameters of the production and the use of the blank as the variables, the grey wolf optimization algorithm was adopted to solve the problem. Taking the gear blanks dimension as an example, the optimized blank dimension is 98.6, compared with the standard blank dimension of 100, 105, the energy consumption is 95.7% and 93.1%, the carbon emission is 92.6% and 90.2%, and the material consumption is 96.5% and 87.5%, respectively. The sustainable blank dimension design has obvious advantages in terms of low energy consumption and low carbon, and it can save a lot of materials; it can also promote product sustainability